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Abstract 

Background: To complement next-generation sequencing technologies, there is a pressing need for efficient pre- 
sequencing capture methods with reduced costs and DNA requirement. The Alu family of short interspersed 
nucleotide elements is the most abundant type of transposable elements in the human genome and a recognized 
source of genome instability. With over one million Alu elements distributed throughout the genome, they are 
well positioned to facilitate genome-wide sequence amplification and capture of regions likely to harbor genetic 
variation hotspots of biological relevance. 

Results: Here we report on the use of inter-Alu PCR with an enhanced range of amplicons in conjunction with 
next-generation sequencing to generate an Alu-anchored scan, or 'AluScan', of DNA sequences between Alu 
transposons, where Alu consensus sequence-based 'H-type' PCR primers that elongate outward from the head of 
an Alu element are combined with T-type' primers elongating from the poly-A containing tail to achieve huge 
amplicon range. To illustrate the method, glioma DNA was compared with white blood cell control DNA of the 
same patient by means of AluScan. The over 10 Mb sequences obtained, derived from more than 8,000 genes 
spread over all the chromosomes, revealed a highly reproducible capture of genomic sequences enriched in genie 
sequences and cancer candidate gene regions. Requiring only sub-micrograms of sample DNA, the power of 
AluScan as a discovery tool for genetic variations was demonstrated by the identification of 357 instances of loss 
of heterozygosity, 341 somatic indels, 274 somatic SNVs, and seven potential somatic SNV hotspots between 
control and glioma DNA. 

Conclusions: AluScan, implemented with just a small number of H-type and T-type inter-Alu PCR primers, provides 
an effective capture of a diversity of genome-wide sequences for analysis. The method, by enabling an 
examination of gene-enriched regions containing exons, introns, and intergenic sequences with modest capture 
and sequencing costs, computation workload and DNA sample requirement is particularly well suited for 
accelerating the discovery of somatic mutations, as well as analysis of disease-predisposing germline 
polymorphisms, by making possible the comparative genome-wide scanning of DNA sequences from large human 
cohorts. 
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Background 

Next-generation, massively-parallel sequencing technolo- 
gies have transformed the landscape of genetics through 
their ability to produce giga-bases of sequence informa- 
tion in a single run. However, the sequencing cost, com- 
putation workload and amount of sample DNA required 
are still too high for large scale population analysis by 
means of whole-genome sequencing. There is clearly a 
need for pre-sequencing capture of subsets of the gen- 
ome in order to reduce these requirements. Although 
the whole exome represents a valuable subset, its exclu- 
sion of introns, and the high cost and high DNA 
requirement for its analysis, remain major limitations. 
Other sequence subsets therefore clearly need to be 
explored. 

Alu-transposons are a family of primate-specific short 
interspersed nucleotide elements (SINE) of ~ 300 bp 
derived from 7SL RNA [1]. Although Alu elements were 
once considered as 'junk DNA', their biological impor- 
tance, in particular their influence on genome instability 
is being increasingly recognized [2,3]. They are abun- 
dant in gene-rich regions [4,5], exert a major impact on 
genomic architecture [6], and increase local recombina- 
tion rates [7]. Previously we have found enhanced SNP 
frequencies in the vicinity of Alu-elements [8], more so 
among the youngest AluY elements than the intermedi- 
ate-age AluS and the oldest AluJ. AluYs display also a 
higher rate of methylation, consistent with a stronger 
silencing pressure on these elements [9]. Genotypic var- 
iations surrounding a human lineage-specific AluY 
insertion in the GABRB2 gene encoding GABA A recep- 
tor p 2 subunit have been found by us to constitute a 
joint focal point for positive evolutionary selection [10], 
hotspot recombinations [11] as well as association with 
schizophrenia and bipolar disorder [12,13]. Neighbor- 
hoods of Alu-transposons are therefore a highly signifi- 
cant sequence subset of the human genome in terms of 
evolutionary development and pathogenesis. 

Inter- Alu PCR is a useful method for isolating human 
DNA in the presence of animal DNA [14], linkage map- 
ping [15], creation of human specific probes and finger- 
prints [16], and detection of mutator phenotypes [17] or 
high frequency genetic alterations [18]. The general 
strategy of the method is to employ a single PCR primer 
based on the Alu consensus sequence to amplify the 
sequence between two Alu elements. With well over a 
million Alu-transposons in the human genome, the 
average distance between two Alus is only 2.4 kb (Figure 
1A), which suggests that inter-Alu PCR with an 
enhanced amplicon range coupled to next-generation 
sequencing could yield a huge sequence subset of the 
human genome for analysis. Accordingly the objective 
of the present study is to examine the possibility of 
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Figure 1 Inter-Alu sequences in the human genome. (A) Length 
distribution of inter-Alu distances between two adjacent Alu- 
transposons in the reference human genome. Heights of bars 
represent number of adjacent Alu pairs in the human genome at 
different inter-Alu distances. Subtraction of empty column at < 200 
bp representing short inter-Alu sequences that were removed from 
analysis during library construction for next generation sequencing 
leaves the solid columns of inter-Alu sequences of varying lengths 
capturable from the genome. The total sum of the solid columns 
up to an inter-Alu distance of 6 kb, 8 kb or 10 kb equals 1.10 Gb, 
1.36 Gb or 1.58 Gb, respectively. (B) Inter-Alu sequences of 0.2-6.0 kb 
in length that are in principle capturable by the three Alu- 
consensual primers AluY278T18, AluY66H21 and R12A/267. Such 
capturable amplicons amount to -14 Mb if no mismatch is allowed 
between the consensual primers and template sequences, or -106 
Mb if one mismatch is allowed per primer. The graph shows the 
length distribution of the latter 106 Mb. 



enhancing the amplicon range of inter-Alu PCR and 
combining it with next generation sequencing to scan 
for sequence and structure variations in the human 
genome. 

Results 

Individual Alu-transposons in the human genome are 
on the average only 15 - 20% divergent from each other, 
and PCR primers complementary to the Alu consensus 



Mei et al. BMC Genomics 201 1, 12:564 
http://www.biomedcentral.eom/1 471 -21 64/1 2/564 



Page 3 of 9 



sequence have been employed for inter-Alu PCR 
[14-18]. Likewise PCR primers based on consensus 
sequences in the AluJ, AluS and AluY subfamilies could 
also be devised. All Alu-based primers can be divided 
into 'H-type' where the primer extends outward from 
the head of the Alu, or 'T-type' where it extends out- 
ward from the poly-A containing tail. Previously, single 
general Alu consensus primers had given rise to agarose 
gel electrophoretograms displaying largely banded, 
banded plus smeared, or largely smeared patterns 
[14-18]. In the present study, varying combinations of 
Alu, AluJ, AluS and/or AluY consensus primers were 
found to yield widely different electrophoretogram pat- 
terns. The presence of a single H-type or T-type primer 
tended to yield a banded, non-smeared pattern sugges- 
tive of a limited amplicon range (lanes A-D, Figure 2). 
In lanes I and L respectively, even two or three T-type 
primers failed to give a non-banded pattern; lane K with 
three H-type primers gave a smeared pattern but lane F 
with two H-type primers gave only a banded pattern. In 
contrast, various primer combinations containing both 
H-type and T-type primers, allowing the amplification 
of intervening sequences between two Alu heads, 
between two Alu-tails as well as between one head and 
one tail, readily yielded a smeared gel indicating the pre- 
sence of a wide diversity of amplicons of different sizes 
(lanes E, G, H, J, P-V). Therefore inclusion of both H- 
type and T-type primers provided the most reliable 
method for achieving huge amplicon range using no 
more than a small number of primers. The greater 



staining intensity of lane S compared to lane R further 
showed that the amounts of amplicons obtained from 
the same primer set could be increased by increasing 
the primer concentrations. 

When AluScans were performed on paired control 
and cancer DNAs extracted from respectively the white 
blood cells and glioma tissue of a male Han Chinese 
patient using the three primers AluY278T18, 
AluY66H21 and R12A/267 described under Methods, 
smeared gels of amplicons up to ~ 6 kb in size were 
obtained (Figure 2, lane Q for control DNA). In each 
case the use of 90 ng sample DNA yielded sufficient 
amplicons for next-generation sequencing on the Illu- 
mina platform with a single flowcell lane and 75 bp 
paired-end reads. The sequencing output has been sub- 
mitted to Sequence Read Archive (SRA) of NCBI. As 
indicated in Table 1, 837 Mb of the initial reads of con- 
trol white blood cell DNA were mapped using the BWA 
program [19] to 58.9 Mb regions on the reference 
human genome (GRCh37.p2), including high quality 
mapping of 717 Mb reads to 10.6 Mb regions with mini- 
mum 10 times and average 67 times coverage. Of the 
latter 10.6 Mb, 95% were inter-Alu sequences, which 
compared favorably with the NimbleGen SeqCap Exome 
array for targeted exon capture with typically 71% 
mapped reads on target [20]; 53% were genie sequences 
including both exons and introns from 8,502 genes, 
representing an enrichment of genie sequences com- 
pared to the overall 40% gene content of the human 
genome; and 34% of the genes belonged to the list of 
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Figure 2 Amplicon range of inter-Alu PCR. The different lanes show gel electrophoretograms of amplicons obtained using different single 
primer or multiple primer sets: (A) AluJo56H16 alone; (B) AluJo232T16 alone; (C) AluSq56H16 alone; (D) AluSq263T16 alone; (E) AluJo56H16 and 
AluJo232T16; (F) AluJo56H16 and AluSq56H16; (G) AluJo56H16 and AluSq263T16; (H) AluJo232T16 and AluSq56H16; (I) AluJo232T16 and 
AluSq263T16; (J) AluSq56H16 and AluSq263T16; (K) AluJo56H16, AluSq56H16 and H-type L12A/8; (L) AluJo232T16, AluSq263T16 and T-type R12A/ 
267, (N) AluY278T18, (P) AluY278T18 and AluY66H21, (Q)-(S) AluY278T18, AluY66H21 and R12A/267, CO AluJo56H16, AluSq56H16 and 
AluSq263T16, (U) AluJo56H16, AluSq263T16 and AluJo232T16, (V) AluJo56H16, AluSq263T16 and L12A/8, and (M) M.W. markers. Notably, lanes E, 
G, H, J, P-V, where the inter-Alu PCR was performed using both H-type and T-type primers, gave rise to a largely smeared gel. Comparison of 
lanes N and P showed the conversion of a banded pattern obtained using a single T-type primer to a smeared one through the addition of an 
H-type primer. The much stronger staining intensity of lane S relative to lane R showed that the 0.30 uM concentrations of the same three 
primers in S compared to 0.10 uM in R resulted in increased amounts of amplicons. The primers AluJo56H16 (5'-GGCTCAAGCGATCCTC-3'), 
AluJo232T16 (5'-TATGATCGTGCCACTG-3') ; AluSq56H16 (5'ACCTCAGGTGATCCAC-3'), and AluSq263T16 (5'-AACAAGAGCGAAACTC-3') were based on 
AluJo and AluSq consensus sequences [20]. H-type L12A/8 was an Alu consensus primer suggested earlier for inter-Alu PCR [15]. AluY278T18, 
AluY66H21 and T-type R12A/67 are described in Methods. 
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Table 1 Control and glioma sequence outputs from AluScan 





Control 


Glioma 


Total initial reads (bp) 


896,661,780 


1 ,065,234,324 


Initial reads (bp) mapped to GRCh37.p2 


836,745,196 


983,798,813 


Genomic regions mapped (bp) with coverage > 1 


58,876,070 


64,297,421 


Reads mapped (bp) with coverage > 10 


717,228,580 


853,715,563 


Genomic regions mapped with coverage > 10: 
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Average depth (x) 


67.4 


72.1 
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1 1 77S 


Genie sequences mapped (bp) 


5,644,023 


6,274,433 


Genie sequences mapped as % of total mapped regions 


53.1% 


53.0% 


Excess in genie sequences 3 


13.1% 


13.0% 


Number of genes with mapped region 


8,502 


9,205 


Number of cancer candidate genes b with mapped region 


2,921 


3,096 


Cancer candidate genes b with mapped region as 


34.4% 


33.6% 


% of genes with mapped region 






Excess in cancer candidate genes c 


8.4% 


7.6% 



a Excess in % genie sequences with mapped region relative to the 40% genie sequences in the reference human genome. 
b Cancer candidate genes refer to genes listed in Gene Ranker: TCGA GBM 6000. 

c Excess in % cancer candidate genes with mapped region relative to the 26% of total human genes included in Gene Ranker: TCGA GBM 6000. 



cancer candidate genes in Gene Ranker: TCGA GBM 
6000, exceeding the 26% of all human genes included in 
that list. The genomic regions mapped by the reads fol- 
lowed closely the number of Alu transposons located on 
the chromosomes (Figure 3). With glioma DNA, 984 
Mb of the initial reads were mapped to 64.3 Mb geno- 
mic regions, including 11.8 Mb high quality regions 
with minimum 10 times and average 72 times coverage. 
The overlap between the high quality regions mapped 
by the control- and glioma-reads totalled 9.5 Mb, equal 
to 90% of the control-mapped regions; the correlation in 
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Figure 3 Correlation between number of Alu elements and 
sequences captured. Relationship between the number of Alu 
elements occurring on individual human chromosomes (blue bars) 
and the amount of AluScan-captured sequences of white blood cell 
DNA mapped to the different chromosomes (red triangles). 
Correlation coefficient was 0.939 between these quantities. 

v / 



read coverage between control and glioma reads was 
high, with r = 0.958; the density distributions of the 
control and glioma reads along different chromosomes 
(Figure 4) were also highly correlated, with r = 0.957. 
These results provided evidence that AluScan performed 
with the same set of primers enabled a reproducible 
genome-wide capture of DNA sequences that were 
enriched in both genie content and cancer candidate 
genes despite the many overlapping inter- Alu amplicons 
that might be amplified by a mixture of H- and T-type 
primers. 

Figure 5 and Additional File 1 show the distributions 
of genetic variations occurring in the 10.6 Mb control 
genomic sequences relative to reference human genome 
among different chromosomes and types of genomic 
regions: there were 18,506 germline SNVs, 11,039 or 
59.6% of which were novel SNVs absent from 
dbSNP132, 2,108 small (< 30 bp) indels and two larger 
indels, viz. a 75-bp deletion on chromosome 19 and a 
767-bp deletion on chromosome 3. When 60 SNV-con- 
taining genie segments including 10 segments each con- 
taining a novel SNV were randomly chosen from the 
control AluScan output for Sanger sequencing, the accu- 
racy of successful SNV verification was 81.6%. 

Comparison of the mapped control and glioma 
sequences identified 274 somatic SNVs between them, 
70.4% of which represented novel SNVs absent from 
dbSNP132. In the control and glioma SNVs relative to 
the reference human genome, as well as the somatic 
SNVs occurring between control and glioma, transitions 
were far more numerous than transversions (Additional 
File 2). There were 357 instances of loss of heterozygosity 
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Figure 4 Density distributions of captured sequences mapped 
on chromosomes. The lengths of genomic sequences (blue or red 
peaks) mapped within 10-kb non-overlapping windows were 
plotted along chromosomes 1-22, X and Y using Human Genome 
Graphs (http://genome.ucsc.edu/cgi-bin/hgGenome) for (A) white 
blood cell DNA and (B) glioma DNA. 



(LOH) and 341 somatic indels between control and 
glioma DNAs. The LOHs were unequally distributed 
among different chromosomes (Additional File 3). Of the 
four particularly LOH-enriched regions, viz. regions lp, 
9p, 9q and 19q indicated in Figure 6, notably lp and 19q 
were known to contain glioma-associated deletions [21], 
which furnished valuable cross validation between AluS- 
can and other genomic approaches. 
Seven 5-Mb intervals in the glioma sequences displayed 
enhanced numbers of somatic SNVs, where the number 
of somatic SNVs > 4, indicating the potential presence 
of somatic SNV hotspots (Figure 6). Of these seven 
potential SNV hotspots, those in chromosomal regions 
12ql3, 17q21, 18pll, 19pl3 and 19ql3 harboured alto- 
gether 16 SNV-containing genes including RAB5C of 
the RAS oncogene family in 17q21 (Additional File 4). 
None of these 16 genes were included in OMIM as a 
known glioma-associated gene. These findings illustrated 
the usefulness of AluScan as a discovery tool. 

Discussion 

Using only 90 ng sample DNA in each instance, the 
AluScans performed in the present study with one H- 



type and two T-type primers generated reads that cov- 
ered a total of -58-64 Mb, or -1.9-2.1% of genomic 
sequences. This total was comparable in order of magni- 
tude to the genomic sequences in principle capturable 
by the set of three H and T-type consensual Alu-based 
primers employed, which were estimated to be -14 Mb 
for exact primer-template matches, or -106 Mb allow- 
ing for one mismatched base-pair per primer (Figure 
IB), but still far below the total of 1.10 Gb inter-Alu 
regions of < 6 kb in length in the human genome (Fig- 
ure 1A). Thus there could be ample room for widening 
the scope of AluScan-capturable sequences through the 
use of diverse combinations of H- and T-type primers. 
Primers specific for other transposable elements such as 
LINEs, LTRs, as well as other types of more specialized 
primers could also be utilized to tailor the AluScan cap- 
ture to a given investigational goal. Moreover, by treat- 
ing target DNA with bisulfite to modify unmethylated 
C-residues prior to AluScan, epigenomic changes in 
normal and diseased cells may also be monitored. 

By combining the twin advantages of multitudinous 
amplification of inter-Alu sequences through the joint 
usage of H-type and T-type primers, and massively par- 
allel next-generation sequencing, AluScan thus provides 
a new method for genome-wide investigation in addition 
to whole genome sequencing (WGS) and whole exome 
sequencing (WES). WGS is the standard in comprehen- 
siveness, but incurs high operation cost, large computa- 
tion workload and multi-microgram DNA requirement. 
WES provides integral insight into the entire exome, but 
leaves the intronic regions uncharacterized, besides 
incurring high capture cost and multi-microgram DNA 
requirement. AluScan permits an examination of gene- 
enriched segments of exons, introns and intergenic 
sequences requiring comparatively modest capture and 
sequencing costs, lighter computation workload and 
only sub-microgram DNA samples. These three meth- 
ods complement one another, together making possible 
a comprehensive analysis of sequence and structure var- 
iations of the human genome. 

Conclusions 

AluScan implemented with just a small number of PCR 
primers based on consensus Alu sequences provides a 
multiplex method for genome-wide sequence analysis. 
Through the inclusion of H and T type primers, the 
approach employs the abundance and wide distribution 
of Alu elements in the human genome as the basis for 
the effective capture of a huge number of DNA 
sequences in the vicinity of Alu elements. As demon- 
strated by the strong correlation between the captured 
white blood cell and glioma sequences, the same set of 
H and T-type primers has led to an extensively reprodu- 
cible subset of genomic sequences in the two separate 
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Figure 5 Genetic variations identified in the white blood cell genome relative to the reference human genome. From the outside, Tracks 
1 and 2: numbering and cytobands of different chromosomes. Track 3: red dots showing number of SNVs per 1 Mb window (scale: 0-60, 1 div = 
12). Track 4: blue dots showing number of indels of < 30 bp per 1 Mb window (scale: 0-12, 1 div = 2.4). Inner circle: green bars showing 
positions of structural variations in the form of two large indels at chromosomes 3 and 19. The figure was drawn using the Circos program [29]. 



AluScans. As well, at least for this set of H and T-type 
primers, the captured sequences were enriched in genie 
and cancer-related DNA sequences. 

The results in Figure 6 illustrate the utility of AluScan 
as a discovery tool Comparison of the paired while blood 
cell-glioma DNAs of a single patient has led to the unco- 
vering of 357 LOHs and 274 somatic SNVs, a majority of 
which likely arising in the glioma, and seven potential 
SNV hotspots located on six different chromosomes. 



Importantly, the modest technical cost and DNA sample 
size required for AluScan will render practicable a follow 
up with similarly paired AluScans for tens to hundreds of 
glioma patients in order to distinguish the somatic and 
germline driver mutations fundamental to the develop- 
ment of the disease from passenger mutations. A major 
application of AluScan will thus reside in its facilitation 
of large cohort studies for clinical and biological investi- 
gations of the human genome. 
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Figure 6 Somatic mutations in glioma cells. From the outsider, Tracks 1 and 2: numbering and cytobands of different chromosomes. Track 3: 
black bars showing the LOH-enriched loci in chromosomal regions, most prominently in 1p, 9p, 9q and 19q. Track 4: blue bars showing number 
of somatic indels of < 30 bp per 5-Mb window (scale: 0-6). Track 5: red bars showing number of somatic SNVs per 5-Mb window (scale: 0-8). The 
cytogenetic locations of seven potential SNV hotspots displaying a density of somatic SNVs > 4, viz. greater than the sum of the Mean (= 1.5) 
plus two-times the standard deviation (= 1.9), per 5-Mb window are indicated as 1 1 q 1 3.4, 12q13, 16p13.1 etc. 



Methods 

DNA samples 

Paired blood and cancer samples were obtained with 
consent and institutional ethics approval from a male 
Chinese Han patient with anaplastic oligodendroglioma 
at Beijing Tiantan Hospital for the preparation of con- 
trol DNA by phenol-chloroform extraction and cancer 
genomic DNA using the AllPrep kit (Qiagen). 



Inter-Alu PCR and next-generation sequencing 

Fifteen parallel 25-[il PCR reaction mixtures each con- 
taining 2 (il Bioline lOx NH 4 buffer (160 mM ammo- 
nium sulfate, 670 mM Tris-HCl, pH 8.8, 0.1% stabilizer), 
3 mM MgCl 2 , 0.15 mM dNTP mix, 0.3 [iM 
AluY278T18 primer, 0.18 \iM AluY66H21 primer, 0.06 
[iM R12A/267 primer, 1 unit Bioline Taq polymerase, 
and 6 ng control or glioma DNA. PCR amplification for 
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AluScan included DNA denaturation at 95°C for 5 min, 
followed by 35 cycles each of 30 s at 95°C, 30 s at 54°C, 
and 5 min at 71°C, and finally another 5 min at 71°C. 
Amplicons were purified with ethanol precipitation, and 
> 3 (ig purified products per sample were employed for 
Illumina GAII library construction and sequencing at 
Beijing Genomics Institute (Shenzhen, China). 
AluY278T18 (S'-GAGCGAGACTCCGTCTCA-S'), 
where 'AluY' represents the subfamily, '278' the first 
position on the AluY consensus sequence paired with 
the primer, T a 'Tail-type' primer (vs. 'H' for 'H-type'), 
and '18' the length of the primer, and AluY66H21 (5'- 
rGGTCTCGATCTCCTGACCTC-S') were AluY con- 
sensus primers [22,23]. R12 A/267 (T-type) was an Alu 
consensus primer employed earlier for inter-Alu PCR at 
an annealing temperature of 56°C [18]. 

Agarose gel electrophoresis 

PCR was performed basically as described in the preced- 
ing section, except that one PCR tube of 20 (il contain- 
ing 100 ng control DNA was employed. The annealing 
temperatures were chosen to maximize in each instance 
the yield of amplicons: 60°C for lane A in Figure 2, 58°C 
for B-D, 56°C for H and L, 64°C for N, and 54°C for the 
other lanes. Primer concentration was 0.30 \iM for the 
single-primer lanes A-D; 0.15 (iM per primer for the 
two-primer lanes E-J; 0.10 (iM per primer for the three- 
primer lanes K, L, R, T-V; 0.30 \iM per primer for the 
triple-dosed lane S. The concentrations of primers 
AluY278T18, AluY66H21 and R12A/267 in lane Q were 
0.375 (iM, 0.225 (iM, 0.075 (iM respectively; lane P was 
same as lane Q with omission of R12A/267; and lane N 
was same as lane P with further omission of 
AluY66H21. 

Read mapping and variant analysis 

Sequence reads were mapped to the GRCh37.p2 refer- 
ence human genome using BWA (bwa-short algorithm 
version 0.5.9rcl) with default settings [19]. Initial map- 
ping results were transferred into indexed and sorted 
BAM format using SAMtools version 0.1.12a [24], and 
further recalibrated and locally realigned using the Gen- 
ome Analysis Toolkit (GATK version 1.0.4905) software 
[25]. Regions with read depths of < lOx were not ana- 
lyzed further. 

The UnifiedGenotyper module in GATK was used to 
produce the primary SNV calls, which were filtered 
using the parameter '-stand_call_conf 50.0' and the Var- 
iant Filtration module, ensuring a coverage depth > 10 
x, mapping quality > 25.0 and strand bias < 0. SNVs in 
the vicinity of indels were removed by means of the 
IndelGenotyperV2 module. Further filtration was 
achieved using the criterion that homozygous reference 
loci have a non-reference read frequency of < 10%, 



heterozygous SNVs have a non-reference read frequency 
of >10% and <85%, and homozygous non-reference 
SNVs have a non-reference read frequency of > 85%. 
Small indels were called using mpileup with '-ugf and 
bcftools with '-bvcg' in SAMtools; and the calls were fil- 
tered using the script vcfutils.pl in SAMtools with 
default settings. Structural variants were identified initi- 
ally using BreakDancer version 1.1 [26] and refined 
using Pindel version 0.20 [27]. Somatic SNVs were 
defined as heterozygous loci present in the tumor gen- 
ome that corresponded to homozygous loci in the con- 
trol genome, and LOH SNVs were defined as 
heterozygous loci present in the control genome that 
corresponded to homozygous loci in the tumor genome. 
Novel somatic SNVs were obtained by removing all 
LOHs and those SNVs already reported in dbSNP132. 
LOHs were identified by comparison between control 
and glioma reads using ExomeCNV version 1.23.0 [28]. 

Additional material 



Additional File 1: Distributions of SNVs and indels among different 
genomic regions. Header: Control SNV = SNVs in control DNA relative 
to reference human genome GRCh37. Glioma SNV = SNVs in glioma 
DNA relative to reference human genome. Somatic SNV = SNVs between 
control and glioma DNAs. Control Indel = indels in control DNA relative 
to reference human genome. Glioma Indel = indels in glioma DNA 
relative to reference human genome. Somatic Indel = indels between 
control and glioma DNAs. LOH SNV = LOHs between control and glioma 
DNAs. 

Additional File 2: Distribution of SNVs among different classes of 
nucleotidyl changes. X-axis shows six different classes of possible 
nucleotidyl change in SNV, and Y-axis shows percentage of each class. 
Columns represent SNVs in control DNA relative to reference human 
genome (blue), in glioma DNA relative to reference human genome 
(red), and between the paired control and glioma DNAs (green). 

Additional File 3: Distribution of LOHs on different chromosomes. 

LOH regions are indicated in red, Non-LOH regions in yellow and 
unmapped regions in black, on horizon line in the diagram for each 
chromosome. Grey dots represent frequencies of non-reference alleles, 
found in either control or glioma SNVs that were not represented in the 
reference human genome. X axis shows position along each 
chromosome, and Y axis the non-reference allele frequency. 

Additional File 4: Genie regions in potential SNV hotspots revealed 
by AluScans. Header: Potential Hotspot = chromosomal location of each 
potential SNV hotspot shown on Figure 6. SNV position = positions of 
different genie SNVs in an indicated hotspot. Region = location of a 
genie SNV in a particular gene shown in 'Gene' column. Gene = name of 
gene in an indicated hotspot containing an SNV. 
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